Use caller CUDA stream for D2H and H2D copies (#20498)#20498
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498
Note: Links to docs will display an error until the docs builds have been completed.
|
|
@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531. |
This PR needs a
|
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
3d8da75 to
07765c3
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
| cudaError_t err = cudaSuccess; | ||
| const auto caller_stream = executorch::extension::cuda::getCallerStream(); | ||
| if (caller_stream) { | ||
| err = cudaMemcpyAsync( | ||
| dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream); | ||
| } else { | ||
| err = cudaMemcpy(dst, src, nbytes, cudaMemcpyHostToDevice); | ||
| } |
| // TODO: validate caller stream device matches index. | ||
| // For now assert single-GPU case. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_host_to_device only supports device 0, got %d", | ||
| static_cast<int>(index)); |
| // TODO: validate caller stream device matches index. | ||
| // For now assert single-GPU case. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_device_to_host only supports device 0, got %d", | ||
| static_cast<int>(index)); |
| cudaStream_t s; | ||
| ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess); | ||
| executorch::extension::cuda::CallerStreamGuard g(s); | ||
|
|
||
| CudaAllocator& a = CudaAllocator::instance(); | ||
| auto res = a.allocate(256, 0); | ||
| ASSERT_TRUE(res.ok()); | ||
| void* d = res.get(); | ||
| std::vector<uint8_t> h(256, 7); | ||
| // should take async branch internally, still return Ok | ||
| EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok); | ||
| a.deallocate(d, 0); | ||
| cudaStreamDestroy(s); |
| cudaStream_t s; | ||
| ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess); | ||
| executorch::extension::cuda::CallerStreamGuard g(s); | ||
|
|
||
| CudaAllocator& a = CudaAllocator::instance(); | ||
| auto res = a.allocate(256, 0); | ||
| ASSERT_TRUE(res.ok()); | ||
| void* d = res.get(); | ||
| std::vector<uint8_t> h_src(256, 5), h_dst(256, 0); | ||
| ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok); | ||
| EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok); | ||
| EXPECT_EQ(h_src, h_dst); | ||
|
|
||
| a.deallocate(d, 0); | ||
| cudaStreamDestroy(s); |
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
07765c3 to
b316b71
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
b316b71 to
98081dc
Compare
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d", | ||
| static_cast<int>(index)); |
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d", | ||
| static_cast<int>(index)); |
| cudaError_t err = cudaSuccess; | ||
| const auto caller_stream = executorch::extension::cuda::getCallerStream(); | ||
| if (caller_stream) { | ||
| err = cudaMemcpyAsync( | ||
| dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream); | ||
| // We don't synchronize the stream here because the caller is expected to |
| if (caller_stream) { | ||
| err = cudaMemcpyAsync( | ||
| dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream); | ||
| if (err == cudaSuccess) { | ||
| err = cudaStreamSynchronize(*caller_stream); | ||
| } |
| ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok); | ||
| EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok); | ||
| EXPECT_EQ(h_src, h_dst); |
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
98081dc to
1e001a5
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
1e001a5 to
4e57af7
Compare
| #include <cuda_runtime.h> | ||
|
|
||
| #include <executorch/extension/cuda/caller_stream.h> | ||
| #include <executorch/runtime/platform/log.h> |
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d", | ||
| static_cast<int>(index)); |
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d", | ||
| static_cast<int>(index)); |
| if (caller_stream) { | ||
| err = cudaMemcpyAsync( | ||
| dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream); | ||
| if (err == cudaSuccess) { | ||
| err = cudaStreamSynchronize(*caller_stream); | ||
| } | ||
| } else { |
| std::vector<uint8_t> h(256, 7); | ||
| // should take async branch internally, still return Ok | ||
| EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok); | ||
| ASSERT_EQ(cudaStreamSynchronize(s), cudaSuccess); | ||
| a.deallocate(d, 0); |
| std::vector<uint8_t> h_src(N, 42); | ||
| EXPECT_EQ(a.copy_host_to_device(dptr, h_src.data(), N, 0), Error::Ok); | ||
|
|
||
| a.deallocate(dptr, 0); |
|
@claude review the PR plz |
|
Claude finished @Gasoonjia's task in 2m 45s —— View job PR Review: Use caller CUDA stream for D2H and H2D copiesThanks @Conarnar — solid direction, and the null/zero-byte hardening + new tests are welcome. I read all four changed files plus the surrounding build files and the 🔴 1. H2D async path never synchronizes → breaks the blocking contract (use-after-free risk)
if (caller_stream) {
err = cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
// We don't synchronize the stream here because the caller is expected to
// synchronize the stream.
} else {
err = cudaMemcpy(...); // blocking
}The base API in Error err = allocator->copy_host_to_device(out.mutable_data_ptr(), self.const_data_ptr(), nbytes, device_index);
// ... returns out; the kernel has no access to the caller streamSo "the caller is expected to synchronize" doesn't hold here — the op kernel can't. The concrete hazard: Pick one and make both consistent:
Given the current callsite, (a) is the right scope for this PR. Fix this → 🔴 2. Likely link/load break on non-MSVC shared builds
On non-MSVC, This PR wired the Buck dep ( 🟠 3. The "single-GPU" guard doesn't actually enforce single-GPU
ET_CHECK_OR_RETURN_ERROR(index == -1 || index == 0, InvalidArgument,
"...only supports device 0 or -1 (current), got %d", ...);
🟡 4. Tests don't validate H2D actually copied data
🟡 5. D2H caller-stream test relies on allocator-internal sync
Minor / nits
SummaryThe async-stream plumbing is the right idea, but I'd block on #1 (semantics/safety) and #2 (build) before merge; #3 is a quick correctness tightening that matches the stated intent; #4/#5 harden the new tests. Nice work on the validation and test coverage otherwise. |
| } | ||
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( |
There was a problem hiding this comment.
The index==-1 || index==0 guard is too broad. The reason to restrict the device is that a green-context caller stream is tied to its creation device. The plain cudaMemcpy path has no caller stream and used to handle any index via setDevice/restore, and tensor_ptr.cpp passes arbitrary indices and aborts on a non-Ok result, so this guard crashes multi-GPU callers that worked before. Suggest checking the device only in the caller-stream branch and letting the no-stream path copy on any index>=-1 like before. Same applies to copy_device_to_host (line 207).
| void* dst, | ||
| const void* src, | ||
| size_t nbytes, | ||
| DeviceIndex index) { |
There was a problem hiding this comment.
copy_host_to_device and copy_device_to_host are almost identical now (null checks, nbytes==0, the index guard, device save/restore, the stream branch), differing only in the copy kind and the sync. Could fold them into one private helper that takes the cudaMemcpyKind to keep the shared logic in one place.
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
4e57af7 to
d19c2d3
Compare
|
|
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
d19c2d3 to
665a2f8
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
665a2f8 to
ae9a55e
Compare
| if (caller_stream) { | ||
| err = cudaMemcpyAsync(dst, src, nbytes, kind, *caller_stream); | ||
| if (err == cudaSuccess && kind == cudaMemcpyDeviceToHost) { | ||
| err = cudaStreamSynchronize(*caller_stream); | ||
| } | ||
| } else { | ||
| err = cudaMemcpy(dst, src, nbytes, kind); | ||
| } |
| const auto caller_stream = executorch::extension::cuda::getCallerStream(); | ||
| if (caller_stream) { | ||
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "%s: with caller stream, only supports device 0 or -1 (current), got %d", | ||
| method, | ||
| static_cast<int>(index)); | ||
| } |
Summary:
CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via
getCallerStream()),copy_host_to_deviceandcopy_device_to_hostusecudaMemcpyAsync. When no caller stream is set, the synchronouscudaMemcpypath is used as before.Additionally:
dst/srcreturnError::InvalidArgumentinstead of aborting incudaMemcpy, and zero-byte copies returnError::Okearly.//executorch/extension/cuda:caller_streamdependency in TARGETS.extension_cudadependencies to CMakeLists.txt.test_cuda_allocatorwith coverage for sync/async paths and error handling.Reviewed By: Gasoonjia
Differential Revision: D109590531